AITopics | output sequence length

Increasing GPU Utilization during Generative Inference for Higher Throughput

Neural Information Processing SystemsFeb-10-2026, 10:57:48 GMT

Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > Carlsbad (0.04)
Asia > Taiwan (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > China (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

S 3 : Increasing GPU Utilization during Generative Inference for Higher Throughput

Neural Information Processing SystemsDec-24-2025, 16:52:03 GMT

Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose $S^3$, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49 throughput over those systems that assume the worst case for the output sequence length.

generative inference, gpu utilization, sequence length, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.44)

Add feedback

Increasing GPU Utilization during Generative Inference for Higher Throughput

Neural Information Processing SystemsOct-8-2025, 11:37:11 GMT

Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.

sequence, sequence length, throughput, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > Carlsbad (0.04)
Asia > Taiwan (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > China (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

S 3 : Increasing GPU Utilization during Generative Inference for Higher Throughput

Neural Information Processing SystemsOct-11-2024, 07:55:36 GMT

Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S 3, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions.

generative inference, output sequence length, sequence length, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput

Jin, Yunho, Wu, Chun-Feng, Brooks, David, Wei, Gu-Yeon

arXiv.org Artificial IntelligenceJun-9-2023

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S$^{3}$, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49$\times$ throughput over those systems that assume the worst case for the output sequence length.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.06

Country:

North America > United States > California > San Diego County > Carlsbad (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Non-Autoregressive Transformer Automatic Speech Recognition

Chen, Nanxin, Watanabe, Shinji, Villalba, Jesús, Dehak, Najim

arXiv.org Machine LearningNov-10-2019

Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structure for speech recognition, which is originally introduced in machine translation. During training input tokens fed to the decoder are randomly replaced by a special mask token. The network is required to predict those mask tokens by taking both context and input speech into consideration. During inference, we start from all mask tokens and the network gradually predicts all tokens based on partial results. We show this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to difficult ones. Some preliminary results on Aishell and CSJ benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed Kaldi nnet3 and chain model setup and is quite closed to the performance of the start-of-the-art end-to-end model.

iteration, prediction, sequence length, (12 more...)

arXiv.org Machine Learning

1911.04908

Country:

North America > United States > Maryland > Baltimore (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Filters

Collaborating Authors

output sequence length

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Increasing GPU Utilization during Generative Inference for Higher Throughput

S 3 : Increasing GPU Utilization during Generative Inference for Higher Throughput

Increasing GPU Utilization during Generative Inference for Higher Throughput

S 3 : Increasing GPU Utilization during Generative Inference for Higher Throughput

S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput

Non-Autoregressive Transformer Automatic Speech Recognition